Fast algorithms for finding a minimum repetition representation of strings and trees

نویسندگان

  • Atsuyoshi Nakamura
  • Tomoya Saito
  • Ichigaku Takigawa
  • Mineichi Kudo
  • Hiroshi Mamitsuka
چکیده

A string with many repetitions can be represented compactly by replacing h-fold contiguous repetitions of a string r with (r)h. We present a compact representation, which we call a repetition representation (of a string) or RRS, by which a set of disjoint or nested tandem arrays can be compacted. In this paper, we study the problem of finding aminimum RRS or MRRS,where the size of an RRS is defined by the sumof the length of component letters and the description length of the component repetitions (·)h which is defined bywR(h) using a repetition weight function wR . We develop two dynamic programming-based algorithms to solve this problem: CMR, which works for any type of wR , and CMR-C, which is faster but can be applied to a constant wR only. CMR-C is an O(n2 log n)-time O(n log n)-space algorithm, which is more efficient in both time and space than CMR by a ((log n)/n)-factor, where n is the length of the given string. The problem of finding an MRRS for a string can be extended to that of finding aminimum repetition representation (of a tree) or MRRT for a given labeled ordered tree. For this problem, we present two algorithms, CMRT and CMRTC, by usingCMRandCMR-C, respectively, as a subroutine. Aswell as the theoretical analysis, we confirmed the efficiency of the proposed algorithms by experiments, which consist of the following three parts: First we demonstrated that CMR-C and CMRT-C are fast enough for large-scale data by using synthetic strings and trees, respectively. The size of an MRRS for a given string can be a measure of how compactly the string can be represented, meaning how well the string is structurally organized. This is also true of trees. To check such ability of MRRS-size, second we measured the size of an MRRS for chromosomes of nine different species. We found that all the chromosomes of the same species have a similar compression rate when realized by anMRRS. Run length encoding (RLE) was also shown to have species-specific compression rate, but species were separated more clearly by MRRS than by RLE. Third we examined the size of an MRRT for web pages of world-leading companies by using the tag trees, showing a consistency between the compression rate by an MRRT and visual web page structures. © 2013 Elsevier B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization of concrete structure mixture plan in marine environment using genetic algorithm

Today due to increasing development and importance of petroleum activities andmarine transport as well as due to the mining of seabed, building activities such as construction of docks, platforms and structures as those in coastal areas and oceans has increased significantly. Concrete strength as one of the most important necessary parameters for designing, depends on many factors such as mixtu...

متن کامل

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth

Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...

متن کامل

Edge sets: an effective evolutionary coding of spanning trees

The fundamental design choices in an evolutionary algorithm are its representation of candidate solutions and the operators that will act on that representation. We propose representing spanning trees in evolutionary algorithms for network design problems directly as sets of their edges, and we describe initialization, recombination, and mutation operators for this representation. The operators...

متن کامل

A New Heuristic Algorithm for Drawing Binary Trees within Arbitrary Polygons Based on Center of Gravity

Graphs have enormous usage in software engineering, network and electrical engineering. In fact graphs drawing is a geometrically representation of information. Among graphs, trees are concentrated because of their ability in hierarchical extension as well as processing VLSI circuit. Many algorithms have been proposed for drawing binary trees within polygons. However these algorithms generate b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Discrete Applied Mathematics

دوره 161  شماره 

صفحات  -

تاریخ انتشار 2013